{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mahalanobis Outlier Algorithm Documentation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The aim of the document is to explain and document the algorithm used in Seldon's Mahalanobis Online Outlier Detector.\n",
    "\n",
    "In the first part we give a high level overview of the algorithm, then we explain the implementation in detail."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output of the algorithm (outlier score) is a measure of distance from the center of the features distribution (Mahalanobis distance). The algorithm is online, which means that it starts without knowledge about the distribution of the features and learns as requests arrive. Consequently you should expect the output to be bad at the start and to improve over time. \n",
    "\n",
    "The output being a real positive number, we leave it to the user to decide on a threshold for when a point will be considered to be an outlier.\n",
    "\n",
    "As observations arrive, the algorithm will:\n",
    "- Keep track and update the mean and sample covariance matrix of the dataset\n",
    "- Apply a principal component analysis using these moments and project the new observations on the first 3 principal components (default value, can be changed)\n",
    "- Compute the Mahalanobis distance from these projections to the projected mean\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implementation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The core of the algorithm is using efficient and numerically stable streaming techniques to keep track of the mean and covariance matrix as new points are observed. The PCA is done by finding the eigenvectors of the covariance matrix using a function implemented in scipy. We also use an efficient algorithm to avoid inverting a new covariance matrix for each point in the batch."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Object state"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The outlier detector class has 4 attributes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "class OutlierMahalanobis(object):\n",
    "    def __init__(self,n_components=3,max_n=None):\n",
    "        self.mean = 0\n",
    "        self.C = 0\n",
    "        self.n = 0\n",
    "        self.n_components = n_components\n",
    "        self.max_n = max_n\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- mean: Online mean of the observed values\n",
    "- C: Online covariance matrix of the observed values\n",
    "- n: number of observations so far\n",
    "- n_components: number of principal components (default = 3)\n",
    "- max_n: Used to make the algorithm non stationary by capping the number of observations to max_n. When specified, the algorithm will behave like if it had seen at most max_n points, thus adapting faster to changes in the underlying distribution. Turned off by default."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### First Step: Tracking the mean and covariance matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "def score(self,features,feature_names):\n",
    "\n",
    "    nb = features.shape[0] # batch size\n",
    "    p = features.shape[1] # number of features\n",
    "    n_components = min(self.n_components,p)\n",
    "    if self.max_n is not None:\n",
    "        n = min(self.n,self.max_n) # n can never be above max_n\n",
    "    else:\n",
    "        n = self.n\n",
    "\n",
    "    # Tracking the mean and covariance matrix\n",
    "    roll_partial_means = features.cumsum(axis=0)/(np.arange(nb)+1).reshape((nb,1))\n",
    "    coefs = (np.arange(nb)+1.)/(np.arange(nb)+n+1.)\n",
    "    new_means = self.mean + coefs.reshape((nb,1))*(roll_partial_means-self.mean)\n",
    "    new_means_offset = np.empty_like(new_means)\n",
    "    new_means_offset[0] = self.mean\n",
    "    new_means_offset[1:] = new_means[:-1]\n",
    "\n",
    "    coefs = ((n+np.arange(nb))/(n+np.arange(nb)+1.)).reshape((nb,1,1))\n",
    "    B = coefs*np.matmul((features - new_means_offset)[:,:,None],(features - new_means_offset)[:,None,:])\n",
    "    cov_batch = (n-1.)/(n+nb-1.)*self.C + 1./(n+nb-1.)*B.sum(axis=0)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we have implemented a numerically stable algorithm for updating the mean and covariance matrix based on the following formulas.\n",
    "\n",
    "**Batch Online Mean**\n",
    "\n",
    "Let $\\bar{X}_n = \\frac{1}{n} \\sum_{k=1}^n{X_k} $ the rolling mean of $ (X_n)_n $\n",
    "\n",
    "Let $\\bar{X}_{n,N} = \\frac{1}{N-n} \\sum_{k=n+1}^N{X_k} $ the batch mean of $X_n$ between $n$ and $N$\n",
    "\n",
    "Then we have: \n",
    "\n",
    "$ \\bar{X}_{n+b} = \\bar{X}_n + \\frac{b}{n+b}(\\bar{X}_{n,n+b} - \\bar{X}_n) $ $ (1) $\n",
    "\n",
    "**Batch Online Covariance Matrix**\n",
    "\n",
    "Let $C_n = \\frac{1}{n-1} \\sum_{k=1}^n{(X_k - \\bar{X}_n)(X_k - \\bar{X}_n)^t} $ the rolling sample covariance matrix of $ (X_n)_n $\n",
    "\n",
    "Then we have:\n",
    "\n",
    "$ C_{n+b} = \\frac{n-1}{n+b-1}C_{n} + \\frac{1}{n+b-1}\\sum^{b-1}_{i=0}\\frac{n+i}{n+i+1}(X_{n+i+1}-\\bar{X}_{n+i})(X_{n+i+1}-\\bar{X}_{n+i})^t $ $ (2) $\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Second Step: PCA and projection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "    # PCA\n",
    "    eigvals, eigvects = eigh(cov_batch,eigvals=(p-n_components,p-1))\n",
    "\n",
    "    # Projections\n",
    "    proj_features = np.matmul(features,eigvects)\n",
    "    proj_means = np.matmul(new_means_offset,eigvects)\n",
    "    if type(self.C) == int and self.C == 0:\n",
    "        proj_cov = np.diag(np.zeros(n_components))\n",
    "    else:\n",
    "        proj_cov = np.matmul(eigvects.transpose(),np.matmul(self.C,eigvects))\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We compute the first principal components: these are the eigenvectors of the sample covariance matrix associated to the largest eigenvalues. For this we use the function [eigh](https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.eigh.html) from ```scipy.linalg```.\n",
    "\n",
    "We then project the new batch of features, the rolling means and the previous covariance matrix on the principal components subspace.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Third Step: Outlier detection in the Principal Components Subspace"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Substep: Fast computation of the inverses of the covariance matrices"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To compute the outlier score of each point in the new batch, we need the inverse of the covariance matrix of all the points up to this one. This means inverting $b$ matrices. We made this operation faster by leveraging the fact that each covariance matrix is a rank one update of the previous one. \n",
    "Knowing this we can use the following theorem.\n",
    "\n",
    "**Theorem:**\n",
    "\n",
    "if $A$ and $A+B$ are invertible and $rank(B) = 1$ then\n",
    "\n",
    "$ (A + B)^{-1} = A^{-1} - \\frac{1}{1+trace(BA^{-1})}A^{-1}BA^{-1} $\n",
    "\n",
    "The implementation is:\n",
    "\n",
    "```python\n",
    "    coefs = (1./(n+np.arange(nb)+1.)).reshape((nb,1,1))\n",
    "    B = coefs*np.matmul((proj_features - proj_means)[:,:,None],(proj_features - proj_means)[:,None,:])\n",
    "\n",
    "    all_C_inv = np.zeros_like(B)\n",
    "    c_inv = None\n",
    "    _EPSILON = 1e-8\n",
    "\n",
    "    for i, b in enumerate(B):\n",
    "        if c_inv is None:\n",
    "            if abs(np.linalg.det(proj_cov)) > _EPSILON:\n",
    "                c_inv = np.linalg.inv(proj_cov)\n",
    "                all_C_inv[i] = c_inv\n",
    "                continue\n",
    "            else:\n",
    "                if n + i == 0:\n",
    "                    continue\n",
    "                proj_cov = (n + i -1. )/(n + i)*proj_cov + b\n",
    "                continue\n",
    "        else:\n",
    "            c_inv = (n + i - 1.)/float(n + i - 2.)*all_C_inv[i-1]\n",
    "        BC1 = np.matmul(B[i-1],c_inv)\n",
    "        all_C_inv[i] = c_inv - 1./(1.+np.trace(BC1))*np.matmul(c_inv,BC1)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Finally, we update the attributes and return the outlier scores"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "    self.mean = new_means[-1]\n",
    "    self.C = cov_batch\n",
    "    self.n += nb\n",
    "\n",
    "    feat_diff = proj_features-proj_means\n",
    "    outlier_scores = np.matmul(feat_diff[:,None,:],np.matmul(all_C_inv,feat_diff[:,:,None])).reshape(nb)\n",
    "\n",
    "    return outlier_scores\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The outlier score is the Mahalanobis distance in the Principal Components Subspace.\n",
    "\n",
    "$ score_n = (X_n - \\bar{X}_{n-1})^tC_{n-1}^{-1}(X_n - \\bar{X}_{n-1}) $"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
